Create RFC for health endpoint #141

Kump3r · 2025-09-25T12:11:35Z

Rendered
Previously discussed as well, but I think it makes sense to collect comments on this again, due to industry standards and an overall need for this. Should we open a discussion to collect interest/opinions of the community as well?

Signed-off-by: Kump3r <[email protected]>

radoslav-tomov · 2025-09-25T13:56:43Z

141-health-endpoint/proposal.md

+  "status": "healthy/unhealthy",
+  "details": {
+    "database": "healthy/unhealthy",
+    "workers": "healthy/unhealthy",


"workers" as a single entry does not convey much meaningful information IMO. A list of the status of each worker might be more useful. Also, the semantics of the general "status" should be clarified. What is considered a healthy instance ?

Yeah haven't though about it that way, thanks for the input. Extending it even a bit further with our 1on1 discussion, we might not even need information about the workers or database, but more or less whether API is working and whether workloads are schedulable. So not looking at specific interfaces, or services bu more or less. Is the ATC working and if so, can it schedule workloads. An example that comes to mind is a systematic/periodic one off build which is tracked by this backend and reports a simple "run-jobs: healthy". Should the API be not reachable the endpoint will be down anyway. So in that case it would look like:

"status": "healthy", "run-jobs": "healthy"

Should it fail to tun jobs in a certain time frame the status will change to unhealthy.
Does that more or less sum it up, or am I missing something?

Kump3r · 2025-09-26T07:09:20Z

I had an offline discussion with a stakeholder who also raised a valid question that can be added to the document:

Q: "How is Concourse on K8s determining the state of the pods?"
A: There are liveness and readiness probes defined in the chart, which make a http request to the /api/v1/info endpoint.
The idea of the change would be to have a more dedicated endpoint that could build a bit on that static endpoint checking by also considering the status which can change dynamically.

Thanks for the question I will also add this to the document when a bit more questions are gathered!

Kump3r · 2025-09-26T09:36:46Z

Also part of an offline follow-up with another user:

"You can think of Google status health" to have a more red/green status pointing towards potential problems with the application.

I think a GUI change is a bit out of scope of this RFC, albeit this RFC would enable this to be easily extended in the UI, so it is worth writing it done as a possible future follow-up

taylorsilva · 2025-09-26T20:27:30Z

Totally for this. A lot of my questions come around implementation, which I see already written down in the POC PR concourse/concourse#4818. I think it would be nice for this RFC to define specifically what we want the Health JSON response to look like.

A Concourse web node is made up of a bunch of micro-service-ish components. We could potentially display the health of all of these components (see components.go). There may be some exceptions in that file, but most of these components are run "globally" across one of the web nodes based on workload the web node is handling. They're load-balanced!

There are some services on the web node that are not load-balanced, like the TSA and API. Those are always running on all web nodes.

A detailed health response could look something like this, which I think would accurately describe the entire Concourse cluster:

{
    "status": "...",
    "workers": {
        "worker-1": {
            "baggageclaim": "...",
            "garden": "..."
        }
        ...
    },
    "web-nodes": {
        "web-1": {
            "api": "...",
            "tsa": "...",
            "db-connection": "...",
            ...
        }
        ...
    },
    "global-components": {
        "log-collector": "...",
        "lidar": "...",
        "secret-management": "...",
        "scheduler": "..."
        ...
    }
}

I wouldn't expect an initial PR to fully implement all of that though. I think this RFC could clearly define what we want the end goal to look like and then slowly work towards it through multiple PR's. WDYT?

Kump3r · 2025-09-27T12:11:22Z

I agree, I really like the idea and I am all for having an easy to reach status board for all of the components. One of the key questions that come to mind is when the overall status should change to not healthy, as albeit each component having its share of work to be done, if they flap, or are unstable it shouldn’t mean the instance is not operational, but rather somewhat degraded. So more, or less building on what you wrote, it would be great before closing the RFC to have the json response and the conditions that are hard requirement for a healthy instance figured out. Thanks to all for the feedback, I like the overall direction of the discussions here. Once we have q couple of more comments, I will add all the discussions to the document.

DimitarKapashikov · 2025-09-30T14:35:40Z

If you plan to reuse the same endpoint for Kubernetes health checks, you can introduce a parameter to differentiate between web and worker nodes. For example:

/health?component=web
/health?component=workers

It could also be extended to the pod level, such as:

/health?component=workers-1
/health?component=workers-n

This way, Kubernetes can identify and restart individual pods if they become unhealthy.

Signed-off-by: Kump3r <[email protected]>

linux-foundation-easycla · 2025-11-04T10:00:03Z

The committers listed above are authorized under a signed CLA.

✅ login: Kump3r / name: Kump3r (3b2ae43, 9320105)

Create RFC for health endpoint

e4b0430

Signed-off-by: Kump3r <[email protected]>

Kump3r force-pushed the master branch from 9f2b652 to e4b0430 Compare September 25, 2025 12:12

radoslav-tomov reviewed Sep 25, 2025

View reviewed changes

Address comments so far and extend document with detailed proposition

3b2ae43

Signed-off-by: Kump3r <[email protected]>

Merge branch 'concourse:master' into master

9320105

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create RFC for health endpoint #141

Create RFC for health endpoint #141

Uh oh!

Kump3r commented Sep 25, 2025 •

edited

Loading

Uh oh!

radoslav-tomov Sep 25, 2025

Uh oh!

Kump3r Sep 25, 2025

Uh oh!

Kump3r commented Sep 26, 2025 •

edited

Loading

Uh oh!

Kump3r commented Sep 26, 2025 •

edited

Loading

Uh oh!

taylorsilva commented Sep 26, 2025 •

edited

Loading

Uh oh!

Kump3r commented Sep 27, 2025

Uh oh!

DimitarKapashikov commented Sep 30, 2025

Uh oh!

linux-foundation-easycla bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Create RFC for health endpoint #141

Are you sure you want to change the base?

Create RFC for health endpoint #141

Uh oh!

Conversation

Kump3r commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

radoslav-tomov Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kump3r Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Kump3r commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kump3r commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taylorsilva commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kump3r commented Sep 27, 2025

Uh oh!

DimitarKapashikov commented Sep 30, 2025

Uh oh!

linux-foundation-easycla bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kump3r commented Sep 25, 2025 •

edited

Loading

Kump3r commented Sep 26, 2025 •

edited

Loading

Kump3r commented Sep 26, 2025 •

edited

Loading

taylorsilva commented Sep 26, 2025 •

edited

Loading

linux-foundation-easycla bot commented Nov 4, 2025 •

edited

Loading